bench(v1.5.4): suite redesign — prune saturation, fair h2c, deeper drivers, WS/SSE competitors by FumingPower3925 · Pull Request #199 · goceleris/probatorium

FumingPower3925 · 2026-06-21T18:16:02Z

v1.5.4 benchmark suite redesign (Part A)

Sharpens the 50-adapter × scenario grid for a thesis-defense audience: cut compute waste, deepen where signal lives, and make every comparison defensible. Grounded throughout in the live v1.5.3 results data.

W1 — Prune NIC-saturated / redundant cells (`f968b09`)

Removed the static rows whose RPS is fabric-bound on the 20 Gbps link (RPS converges → not a ranking signal): get-json-8k/16k/64k, post-8k/16k/64k, and the two large-body h2 rows (get-json-64k-h2, post-64k-h2). Static H1 12→6, H2 4→2. Kept post-1m as the single documented wire-bound datapoint for the methodology section, and the full concurrency sweep (the io_uring crossover evidence).

W3 — Driver depth 4 → 10 (`67751c4`, `682d853`)

Turned the single-GET driver story into a multi-dimensional one: driver-pg-write (INSERT), driver-pg-update-tx (BEGIN/UPDATE/COMMIT), driver-pg-read-range (N-row SELECT), driver-redis-set, driver-redis-pipeline (the strongest native-vs-go-redis differentiator), driver-mc-multiget. Implemented across all 9 driver adapters (celeris native .Async() + idiomatic pgx/go-redis/gomemcache). Adds an unlogged bench_writes fixture table; also fixes a latent env-var split (PROBATORIUM_MC_ADDR → PROBATORIUM_MEMCACHED_ADDR) that was 503-ing memcached on half the adapters.

W4 — WS/SSE competitors (`1137fdf`)

Native WebSocket + SSE for axum (Rust), hono (Bun), starlette (Python) matching the fixed wire contract (/ws ?mode=ws-echo|ws-large-echo|ws-hub, /events text/event-stream, 1 ms publish). Each was build- and runtime-smoke-verified locally (101 upgrade, echo round-trip, hub broadcast, SSE frame, large-echo). Flipped Capabilities{WS,SSE} on the three h1 columns.

W2 — h2c fair-fight, by measurement (`3f9dacc`)

Audited the advertised SETTINGS_INITIAL_WINDOW_SIZE of every h2c column with a wire probe (loadgen adopts the server's advertised window as its per-stream upload cap, so a small window throttles POSTs independent of server speed). Result refutes the blanket-artifact premise: the entire Go field (incl. hertz) and both Rust columns already advertise celeris's 1 MiB window. Only Kestrel (768 KiB) and the Bun node:http2 columns (64 KiB) lagged — now equalized to 1 MiB. Disclosed caveat: the two hypercorn Python columns expose no per-stream window knob (they're the slowest columns regardless). Methodology recorded as a code comment above the h2c columns.

W5 — Budget reconcile (`c9e8715`)

Live cmd/runner -dry-run resolves 1111 capability-gated cells/pass (was pinned 1257/820 across two drifted constants). Updated both; retuned the headline window (60s/15s→40s/12s) so the grown grid still fits 24 h. All budget invariants pass with the true count.

Verification

go build ./... + go test ./... green. Rust/Bun/Python/C# adapters built and the patched h2c columns re-probed from rebuilt artifacts (all advertise 1 MiB). The macro gate is the next cluster bench run, which produces the v1.5.4 numbers.

Remove the rows whose RPS converges at the 20G fabric line rate (the harness already flags them network_bound): get-json-8k/16k/64k + post-8k/16k/64k (H1) and get-json-64k-h2 + post-64k-h2. These burned ~288 cells/run (~5h) without differentiating fast adapters. Keep post-1m as the SINGLE documented wire-bound datapoint for the saturation/methodology discussion (not a ranking row). Drop the now-unused post8k/16k/64k payload generators; update the registry/category/body-size test guards. Net static rows: 12 H1 -> 6, 4 H2 -> 2. Funds the driver + WS/SSE depth.

…(W3 part 1) Define the 6 new driver-depth scenarios (writes / transaction / range / pipeline / multiget) + their routes/bodies, and add the unlogged bench_writes PG table + FixtureRedisWriteKey to services. Verified: scenarios build/vet + the registry/category tests pass with 10 driver scenarios. NEXT (W3 part 2): implement the 6 handlers across the 9 driver adapters (servers/*/driver_handlers.go) — celeris via its native driver/{postgres, redis,memcached} (Pipeline/BeginTx/GetMulti all confirmed present), Go competitors via idiomatic pgx/go-redis/gomemcache; standardize the memcached env var; add conformance routes. Until then these rows 404 (feature branch only; not runnable yet).

…pters (W3 part 2) Add pg-write / pg-update-tx (BEGIN-UPDATE-COMMIT) / pg-read-range (N-row) / redis-set / redis-pipeline (batched GETs) / mc-multiget to every driver adapter: celeris via its native driver/{postgres,redis,memcached} (Pipeline / BeginTx / GetMulti), the 8 Go competitors via idiomatic pgx/go-redis/ gomemcache, each matching the adapter's framework idiom. Routes are /cache-pipeline and /mc-multiget (not /cache/pipeline, /mc/multi) to avoid colliding with the /cache/:key and /mc/:key param routes. Also standardize the memcached env var: fasthttp/echo/iris read PROBATORIUM_MC_ADDR while the other 6 read PROBATORIUM_MEMCACHED_ADDR — a latent bug where validate.yml (which only set MC_ADDR) 503'd the 6 MEMCACHED_ADDR adapters on every memcached cell. All read MEMCACHED_ADDR now; validate.yml fixed; run_bench_cell.yml dual-set collapsed. All 9 adapter modules build; scenarios test green (10 driver scenarios).

Widen the WS/SSE grid beyond celeris+gorilla. Each adapter serves the fixed wire contract (GET /ws ?mode=ws-echo|ws-large-echo|ws-hub, GET /events text/event-stream, 1ms publish, "payload"/"hello"): - axum (Rust): axum::extract::ws + response::sse; a single broadcast tick drives both hub fan-out and SSE. serve_h1 gains .with_upgrades() (mandatory — without it hyper writes 101 then drops the socket); h2c serve path untouched. - hono (Bun): Bun.serve native websocket + SSE ReadableStream on the h1 branch; one 1ms ticker, drop-on-backpressure fan-out. - starlette (Python): WebSocketRoute + StreamingResponse; the two 1ms asyncio tickers start per worker via a lifespan; WS rides uvicorn[standard]'s bundled websockets (no new dep). Flip Capabilities{WS,SSE} on the three h1 columns. featureSetFor already projects them and streaming gates on fs.HTTP1, so the -h2 siblings stay out of the streaming grid. All three were build- and runtime-smoke- verified locally (101 upgrade, echo round-trip, hub broadcast, SSE frame, 256 KiB large-echo).

… (W2) Audit + fix the h2c "fair fight". loadgen adopts the server's advertised SETTINGS_INITIAL_WINDOW_SIZE as its per-stream upload window, so a small advertised window throttles POST throughput independent of server speed. Measured the advertised SETTINGS of every h2c column with a wire probe: celeris ................. 1 MiB window / 1 MiB frame / 100 streams gin/echo/chi/iris/hertz/stdhttp (Go net/http) ... 1 MiB (already fair) axum-h2 / hyper-h2 (Rust hyper) ................. 1 MiB (already fair) aspnet-h2 (Kestrel) ............................. 768 KiB -> 1 MiB hono-h2 / elysia-h2 (Bun node:http2) ............ 64 KiB -> 1 MiB fastapi-h2 / starlette-h2 (hypercorn/h2) ........ 64 KiB (see caveat) So the original "h2c is mostly an artifact" premise is refuted by measurement: the entire Go field and both Rust columns already advertise celeris's 1 MiB window. Only Kestrel and the Bun columns lagged; their adapters now explicitly advertise the 1 MiB / 100-stream profile: - servers/aspnet/Program.cs: Http2 InitialStream/ConnectionWindowSize = 1 MiB, MaxStreamsPerConnection = 100. - servers/{hono,elysia}/src/h2c.ts: node:http2 settings.initialWindowSize = 1 MiB + maxConcurrentStreams 100, plus session.setLocalWindowSize to lift the connection window off its 64 KiB default. DISCLOSED CAVEAT: fastapi-h2/starlette-h2 ride hypercorn, which exposes no per-stream initial-window knob; they keep the h2-library default and can't be equalized without a fragile monkeypatch. They are the slowest columns regardless, so the window is not their binding constraint. Methodology recorded as a block comment above the h2c columns in servers/servers.go. All three patched columns re-probed from their rebuilt artifacts: each now advertises INITIAL_WINDOW=1048576.

The v1.5.4 redesign reshaped the grid (W1 pruned saturated static rows, W3 deepened drivers 4->10, W4 added WS/SSE to three columns). A live `cmd/runner -dry-run -cells '*/*'` now resolves 1111 capability-gated cells/pass (52 adapters x 44 scenarios), so the stale pins move: FastRealizedCells: 1257 -> 1111 (fast = 35s/10s window = 19.1h < 24h) FullRealizedCells: 820 -> 1111 (same realized "*/*" grid as Fast; the 820/1257 split was pre-existing drift) HeadlineWeekly's per-cell window shortened 60s/15s -> 40s/12s: the longer window no longer fits the grown grid in 24h (1111 x 92s = 28.4h), the shorter one does (1111 x 69s + ~0.7h rated = ~22.0h < 24h). All budget invariants (TestWeeklyConfigFitsBudget / TestFastFitsWithin24h) pass with the true count instead of a stale pin. Comment arithmetic updated to match.

…nify session-rw, pg synccommit, post-1m non-ranking An adversarial pre-run suite audit found 3 run-invalidating issues. Fixed all: 1. MIDDLEWARE/CHAIN REMOVED (12 scenarios): the 4 chain stacks compared unequal middleware work across adapters in OPPOSITE directions (celeris 0 CORS headers + 0-work CSRF but 9 secure headers + full structured logger; competitors 3 CORS + 1 CSRF cookie + 4 secure + 2-token logger; echo/iris a third set). Any fullstack/security ranking off that was an artifact, not a finding. Owner decision: remove them (also saves ~276 cells of compute). scenarios/chain.go init() now registers nothing. 2. driver-session-rw UNIFIED across all 9 adapters: it ran 0-3 backend round-trips per adapter (iris used an in-memory store touching NO redis; gin INCR+EXPIRE=2; echo HSET+HINCRBY+EXPIRE=3; others 1 SET) — not a framework comparison. Now every adapter does an identical GET+SET of the fixed key `pmsess:bench` (seeded in services.go) + a uniform 256B body merge = exactly 2 redis ops. celeris keeps its NATIVE driver, competitors go-redis (the intended native-vs-idiomatic split; only the OPS match). (Also fixed servers/stdhttp which was non-compiling on the branch.) 3. driver-pg-update-tx: postgres now runs `-c synchronous_commit=off` (services.go local + ansible bench.yml/validate.yml cluster) so the BEGIN;UPDATE;COMMIT measures driver/framework overhead, not host disk fsync rate (was ~1k RPS = disk, identical for everyone). 4. post-1m is now UNCONDITIONALLY excluded from the RPS ranking (report/document.go isWireBoundByDesign) — its runtime network-bound guard no-ops on the Tailscale overlay (line rate unknown), so it would otherwise publish as a raw-RPS row. Grid: 41->29 scenarios, 1111->835 cells/pass (~14.4h fast, single arch). Budget constants reconciled. Root + all 9 adapter modules build; budget + scenarios + report tests pass.

…ed guard A second adversarial pre-run audit. Verified the headline-ranking findings only reach a rated/publish run (the saturation-only Fast matrix renders no LatencyAtSLO table), then fixed four real issues + a stale test: 1. HEADLINE RANKING now consults NetworkBound. writeLatencyAtSLOSection bolded a per-column leader for every scenario carrying rated data, including the wire-bound post-1m (the NetworkBound flag set in BuildDocument was never read by the ranking), the tick-bound fan-out cells (ws-hub-broadcast-*, sse-fanout-*, paced by the 1ms publish tick), and the single-conn latency probe get-json-1c (RPS == 1/latency). New isFanoutBound + isLatencyProbeByDesign predicates + a headlineRanked() gate drop them from the bolded headline (they stay in the detail/tail-latency/network-bound sections) and disclose the exclusion in a note. Added TestHeadlineExcludesNonCPUBoundScenarios. 2. RATED PIN reconciled: budget.RatedScenarios listed auto-mix-111, which is never registered, so the -cells filter matched nothing and the rated grid was 16 cells while Headline/FullRatedRealizedCells pinned 24. Removed the dead entry, re-pinned 24->16, added TestRatedRealizedCellsMatchSubset. 3. SEED GUARD: services.VerifySeed reads back a canonical fixture from each backend after seeding (users row count, redis/mc demo-key + session keys) and errors on the seed step if any is missing — turning a silent partial or mis-targeted seed (which would publish a no-work driver cell as a fast 200) into a loud failure before any cell runs. Wired into Seed + SeedExternal. 4. ENV: validation/refapp/driver_memcached read PROBATORIUM_MC_ADDR while ansible/validate.yml exports PROBATORIUM_MEMCACHED_ADDR (dead override, masked by a matching default) — unified to PROBATORIUM_MEMCACHED_ADDR. Also fixed the stale TestCellsGlobServersRespectsExcludes want list (the v1.5.4 celeris-column expansion grew celeris-* 4->9 engine modes; the test still hardcoded 4 and failed under -tags mage) and the stale 60s/15s headline-window docstring in mage_tier.go (now 40s/12s). Root + adapter modules build; report/budget/services tests + the -tags mage grid guards pass.

Follow-up to a418360 (the round-2 audit fixes): - budget.go ForProfile docstring still claimed the headline window was 60s/15s; corrected to 40s/12s to match HeadlineWeekly() and mage_tier.go. - Reword the RatedScenarios comment to reflect auto-mix-111 was deleted for good (not 'auto-mix-111 ... re-add once registered'), per owner decision that the scenario is gone. The dead entry was already removed from the rated pass in a418360.

…val)

FumingPower3925 added 11 commits June 21, 2026 19:06

deps: pin celeris v1.5.4 + loadgen v1.4.10 across adapter + refapps

dc64529

lint(scenarios): drop unused chainScenarioName (dead since chain remo…

4a75f15

…val)

FumingPower3925 merged commit 0c36503 into main Jun 23, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(v1.5.4): suite redesign — prune saturation, fair h2c, deeper drivers, WS/SSE competitors#199

bench(v1.5.4): suite redesign — prune saturation, fair h2c, deeper drivers, WS/SSE competitors#199
FumingPower3925 merged 11 commits into
mainfrom
feat/v1.5.4-bench-redesign

FumingPower3925 commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FumingPower3925 commented Jun 21, 2026

v1.5.4 benchmark suite redesign (Part A)

W1 — Prune NIC-saturated / redundant cells (f968b09)

W3 — Driver depth 4 → 10 (67751c4, 682d853)

W4 — WS/SSE competitors (1137fdf)

W2 — h2c fair-fight, by measurement (3f9dacc)

W5 — Budget reconcile (c9e8715)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

W1 — Prune NIC-saturated / redundant cells (`f968b09`)

W3 — Driver depth 4 → 10 (`67751c4`, `682d853`)

W4 — WS/SSE competitors (`1137fdf`)

W2 — h2c fair-fight, by measurement (`3f9dacc`)

W5 — Budget reconcile (`c9e8715`)